This is an introductory course on open data science which intends to teach students the modern approaches in reproducible research. The main tool in this course is the R programming language, a de-facto standard in the statistical research and processing applicatinos. Moreover, modern best practices in research suggest sharing not only the final result of the study by make it in a easy-to-reproduce way. For studies perfromed mainly with the R language this could be achieved by using the Rmarkdown format for reportimg. Rmarkdown includes both the R source code, explanations and comments in formatted text form as well as the output of the R code. One of the goals of this course is to show the students how to utilise these powerful instruments in the best way and encourage them to apply these techniques in their research.
My personal expectations from this course are quite high. I hope I would be able to get a bit of fluency in R and apply some of the covered approaches in my research.
The project repository: https://github.com/joewkr/IODS-project
A short summary and things learned
First of all, we load the data set which we are going to work with. This is a preporcessed data set which provides information about the survey taken by students of a statistics course and it also includes the grades received after the course.
learning2014 <- read.csv('learning2014.csv')
dim(learning2014)
## [1] 166 7
This data set provides information about 166 students for 7 available parameters. These parameters include student’s age and gender, general attitude towards statistics, their replies to survey questions combined in a four categories, and the total number of points obtained during the final exam of the course. Structure of the data set and a sample subset could be obtained with the following command:
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
For a general quick overview of the data set we could use the pairs function to generate a simple scatter plot figure comparing all the variables against each other.
pairs(learning2014)
The
points column is of primary interest for us because it provides information about student grades on the exam. As can be seen from the figure there is no easy to spot dependency between exam points and other variables except the attitude. The age column seems to have some interesting features as well, though less pronounced. The majority of students was in their twenties as can will be evident from the estimated distribution function introduced further down in the text. And this big group of students received all the possible grades from very low to high. But if we focus on the older student we could see that for them there seems to be negative correlation between their age and points received at the exam.
To get a more clear overview of dependencies between different variables it is often useful to have also some numerical measure, for example, correlation coefficient to complement scatter plots. Here we calculate correlations between all the numeric variables in the data set:
numeric_columns <- tail(names(learning2014))
cor(learning2014[numeric_columns])
## age attitude deep stra surf points
## age 1.00000000 0.02220071 0.02507804 0.10244409 -0.1414052 -0.09319032
## attitude 0.02220071 1.00000000 0.11024302 0.06174177 -0.1755422 0.43652453
## deep 0.02507804 0.11024302 1.00000000 0.09650255 -0.3238020 -0.01014849
## stra 0.10244409 0.06174177 0.09650255 1.00000000 -0.1609729 0.14612247
## surf -0.14140516 -0.17554218 -0.32380198 -0.16097287 1.0000000 -0.14435642
## points -0.09319032 0.43652453 -0.01014849 0.14612247 -0.1443564 1.00000000
The correlation matrix shows that students which planned to not get too deep into the topic of the course (i.e. the students with high surf score) tend to receive lower grades on the exam. Surprisingly, the students whom indicate in their surveys that they strive to get deeper understanding of the topic (high score in the deep group) did not show better (or worse) exam grades than other student groups. Except the students with high general attitude towards statistics, only the students which tend to use systematic and organized approach to studying (high stra score) were able to get better scores on the final exam.
Another insight on the data set could be obtained from the estimated distributions of the numerical variables:
i <- 1
for(c in tail(names(learning2014))) {
p <- ggally_densityDiag(learning2014, mapping = aes_string(x=c)) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
var_name <- paste('p', i, sep='')
assign(var_name, p)
i <- i + 1
}
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 6)
There it is clear that the majority of students taken the survey are around 20 year old and there is not too many older students in the group. Distributions for the
surf and deep scores show somewhat opposite picture with surf tending to have lower values and deep – higher. I guess to some extent this could be caused by the human factor since the questions in the deep group tend to represent a good student while questions in the surf group describe a somewhat lazy person. Also an interesting observation there is less probability of getting 13…15 points on the exam than nearby values which could be caused by some peculiarities of the grading system.
Finally we could provide the overall summary of the data set which supports the earlier observations:
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
So, which variables could be used to build a linear model for predicting the exam grade? Obviously, attitude should be included since it shows the highest correlation with the exam points. But both deep and surf groups seem to be influenced by human factor making them less applicable for this modeling exercise, unlike the stra group (which also has the second-highest correlation with the exam points). Therefore for building the model we take attitude, stra and age since it shows some interesting features for the older students.
model <- lm(points ~ attitude + stra + age, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The summary information of the fitted model provides basic overview about the fitted model and gives important information about the statistical significance of the assumed dependency between the target and explanatory variables. For each of the explanatory variables the significance test provides probability of getting the observed values of the target variable if the explanatory variable in question is has random values. As a consequence, when the p-value is high it is less likely that target variable depends on that explanatory variable. For our model, only the attitude parameter shows very low p-value of 4.72e-09 which indicates strong and significant dependency between the exam point and students attitude. The other two explanatory variables stra and age, with p-values of 0.0621 and 0.0981 though still significant assuming the significance level \(\alpha\) = 0.10 show much weaker connection between them and the target variable.
The obtained linear model which is written as:
\[
\text{points} = 10.9 + 3.481\cdot\text{attitude} + 1.004\cdot\text{stra} -0.08822\cdot\text{age}
\] shows that with higher attitude and stra scores, students would tend getting higher final exam grade. Although for the older students the negative age coefficient indicates that these students tend to have lower final exam grade than younger students with the same attitude and stra scores.
Another important parameters provided in the summary of our model is the multiple R-squared value which shows how good is the fit of our model. The low value of the multiple R-squared parameter indicate that only about the 20% of variation in the exam grades could be explained by the selected set of explanatory variables. However, this parameter does not indicate whether the selected set of the explanatory variables is optimal or not.
To estimate how good the assumptions taken when fitting the linear model hold we use the following series of diagnostic plots:
par(mfrow=c(3,1))
plot(model, which=c(1,2,5))
There the ‘Residuals vs Fitted’ plot shows the dependency between the fitted model values and residuals calculated as a difference between the observed values and fitted. A sound model should show no clear dependency between the residuals and fitted values. The produced model seems to have reasonable residuals which are more or less evenly distributed across the range of fitted values. Although the worrying point there is the cluster of numbered points (35, 56 and 145) having somewhat larger residual values.
The same pattern could observed on the ‘Normal Q-Q’ plot as well. This plot indicates how well the assumption of normal distribution of the residuals is held (ideally the points should lay on the 1:1 line there). Again on a seemingly reasonable plot the cluster of numbered points could indicate potential issues with the model.
Finally, the ‘Residuals vs Leverage’ plot shows how much weight is given to individual observations when fitting the model. For a well-defined model this plot should not show outlier points with high leverage values. Our model seems to fit reasonably well under this assumption, though it still shows some numbered points. But these points have a relatively low leverage value thus making it a bit less of an issue.
A short summary and things learned
Again, before we start the analysis we load the data set from the file system. This data set provides information about students of two Portuguese schools and their achievements. It contains received grades (variables G1, G2 and G3) as well as students background including social and school-related information. The original data set could be obtained via the following link https://archive.ics.uci.edu/ml/datasets/Student+Performance and detailed description of the data set variables is also provided. Although the original data set was provided separately for two subjects: Mathematics and Portuguese language, in the current study a preprocessed data set is used. Ths data set is obtained by merging the original data sets.
Variables in the perprocessed data set could be shown by applying the names function to the loaded data frame:
alc <- read.csv('alc.csv')
dim(alc)
## [1] 382 35
names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
As could be seen from the output, this data set introduces two additional variables, namely alc_use which is a simple mean of Dalc and Walc, and high_use which indicates relatively high alcohol consumption by a student and defined to be TRUE when alc_use takes values greater than the threshold value of 2.
Since the data set provides quite extensive information about students it could be possible to build a statistical model predicting high alcohol consumption based on these data. For this study I choose the four following variables:
selected_columns <- c('famrel', 'studytime', 'absences', 'goout')
where famrel describes quality of student’s family relationships (with higher values corresponding to better relationships), studytime describes how much time students spends on their studies over a week, absences descries the number of times when a student was absent from school, and goout describes how often a students goes out with friends.
I expect these variables to be connected to the alcohol consumption since it would be natural to assume that a person with bad family relationships who does not spend too much time studying, often absent and going out with friends would show higher alcohol consumption.
To get a better overview of the selected variables and study their possible connection to the alchol consumption a graphical summary could be of a great help.
alc %>%
pivot_longer(all_of(selected_columns), names_to='key', values_to='value') %>%
ggplot(aes(value, fill=high_use)) +
facet_wrap("key", scales = "free") +
geom_bar() +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
The bar plots of the selected variables show no striking features supporting the initial hypothesis. Although the goout tends to have higher values for students who have high alcohol consumption. Surprisingly the absences does not seem to show peaks of high absence for students with high alcohol consumption but rather features a gradual increase of drinkers fraction with increasing days of absence. For the studytime variable the fraction of students with high alcohol consumption greatly diminishes with increasing number of hours spent on studies. And finally famrel shows a somewhat similar picture with relatively less number of drinking students in families with good relationships.
This information could also be provided in tabular form which supports the conclusions drawn from the graphs:
for(var in selected_columns) {
print(table(alc[c('high_use', var)]))
}
## famrel
## high_use 1 2 3 4 5
## FALSE 6 10 39 135 78
## TRUE 2 9 25 54 24
## studytime
## high_use 1 2 3 4
## FALSE 58 135 52 23
## TRUE 42 60 8 4
## absences
## high_use 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 17 18 19 20 21 26 27
## FALSE 52 38 42 33 24 16 16 9 14 6 5 2 4 1 1 0 0 1 0 2 1 0 0
## TRUE 13 13 16 8 12 6 5 3 6 6 2 4 4 1 6 1 1 1 1 0 1 1 1
## absences
## high_use 29 44 45
## FALSE 0 0 1
## TRUE 1 1 0
## goout
## high_use 1 2 3 4 5
## FALSE 19 84 103 41 21
## TRUE 3 16 23 40 32
To predict the high alcohol consumption based on the selected variables a logistical model could be applied:
f_string <- paste('high_use ~ ', paste(selected_columns, collapse = ' + '))
model <- glm(as.formula(f_string), data = alc, family = "binomial")
summary(model)
##
## Call:
## glm(formula = as.formula(f_string), family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8701 -0.7738 -0.5019 0.8042 2.5416
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28606 0.70957 -1.812 0.06992 .
## famrel -0.33699 0.13681 -2.463 0.01377 *
## studytime -0.55089 0.16789 -3.281 0.00103 **
## absences 0.06753 0.02175 3.104 0.00191 **
## goout 0.75953 0.12041 6.308 2.82e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 384.07 on 377 degrees of freedom
## AIC: 394.07
##
## Number of Fisher Scoring iterations: 4
Summary of the trained model indicate that all the selected predictors have statistically significant relations with the target variable high_use taking the significance level \(\alpha = 0.05\). Coefficients of the model suggest that absences has the weakest connection with the target variable whereas the other selected explanatory variables show a more strong link.
A bit more insight on the characteristics of the obtained model could be learned by studying the coefficients in form of odd ratios and providing their confidence intervals:
odds_ratio <- model %>% coef %>% exp
confidence <- model %>% confint %>% exp
## Waiting for profiling to be done...
cbind(odds_ratio, confidence)
## odds_ratio 2.5 % 97.5 %
## (Intercept) 0.2763573 0.06723596 1.0961732
## famrel 0.7139151 0.54460646 0.9331198
## studytime 0.5764343 0.41040872 0.7941804
## absences 1.0698631 1.02591583 1.1187950
## goout 2.1372720 1.69853389 2.7261579
this table shows that for famrel and studytime odds ratios are below one (namely, 0.714 and 0.576) which means that increasing these parameters would decrease the probability of student being a drinker. Contrary, the odds ratios for absences and goout are above one thus increasing these parameters would increase the probability of high alcohol consumption by a student. Another important observation from this table is that the confidence intervals for all the odds ratios do not include 1 meaning that these odds ratios are significant. This is true also for the absences variable which has the odds ratio of 1.07, which is rather close to 1 but still significant.
To investigate the predictive power of our model it could be used to estimate the probability of high alcohol consumption from the same data set which was applied to train the model. To do so, the predict function can be used and if the resulting probability is higher than 0.5 a student is assumed to be a drinker. This information combined with the actual data on alcohol consumption allows generating the cross tabulation of predicted vs actual values:
probabilities <- predict(model, type = "response")
alc <- alc %>% mutate(probability = probabilities, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 242 26
## TRUE 65 49
This table shows an interesting property of the model, it tends to predict the low alcohol consumption better than high one i.e. it gives a considerably high number of false negatives. This property of the model is apparent from the following plot:
alc %>%
ggplot(aes(x = probability, y = high_use)) +
geom_violin(trim = TRUE, fill='#DCDCDC') +
geom_vline(xintercept = 0.5) +
xlim(0,1) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
it is evident that for students having low alcohol consumption the model tends to predict consistent low probability of being a drinker. From the other side for drinking students the model yields a somewhat close to uniform distribution of predictions showing no predictive skill.
The total proportion of the failed predictions (i.e. the training error) could be calculated in the following way by taking a simple average of model predictions:
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2382199
which gives a value of 0.238 meaning that about a quarter of the predictions were incorrect.
Such an approach which uses the whole data set to train a model and the to test it, though simple and straightforward, gives a biased estimate of model quality because all the data points which are used for testing are already ‘known’ to the model since they were used for training. A more robust approach takes only a part of the data set for training the model and keeps a part of it for testing. This approach is applied in the cv.glm function which estimates the model prediction error in presence of unknown observations. There the data set is divided into 10 chunks and only 9 of them are used for model training when the last one is used for validation. This procedure is repeated for all combinations of training and validation chunks returning a K-fold cross-validation prediction error.
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model, K = 10)
cv$delta[1]
## [1] 0.2434555
For the model built in the current exercise the cross-validation procedure gives the prediction error of 0.243 which is higher than the previous estimate because it uses unknown to the model data for validation. This error is lower than the error of the example model (it uses the following formula: high_use ~ sex + failures + absences) given for this course which was about 0.26.
But giving a relatively big number of potential explanatory variables it is not a straightforward task to select the best set for building a model. A naïve solution would be testing all the possible combinations of the explanatory variables and taking a model which gives the lowest prediction error. The problem is that for a data set with 34 free parameters the number of possible combinations of explanatory variables would be:
\[
C = \sum_r^{34} \frac{34!}{r!\cdot(34 - r)!} \equiv 1.7179869\times 10^{10}
\] which renders this idea rather impractical. But to assess the properties of models built with different explanatory variables using only a minor fraction of all possible combinations would suffice. Therefore to investigate the models built with various combinations of different number of explanatory variables the following approach is applied. For each size of the explanatory vector a random selection of the explanatory variables is used to train a model and obtain the model training and prediction errors. This procedure is repeated N times for each explanatory vector size to get multiple model realizations. For each model the model formula, size of the explanatory vector, prediction and training errors are stored in the final data frame for the following analysis.
possible_predictors <- head(names(alc), n = -3)
possible_predictors <- possible_predictors[! possible_predictors %in% c('Walc', 'Dalc', 'alc_use')]
mdf <- data.frame(numeric(0), logical(0), numeric(0), character(0), stringsAsFactors = FALSE)
names(mdf) <- c('size', 'training', 'error', 'model')
num_possible_predictors <- length(possible_predictors)
for(i in 30:1) {
# Do not test all the combinations for models with more than
# three predictors, it takes way too much time.
if(i <= 3) {
selected_predictors <- gtools::combinations(
n=num_possible_predictors
, r=i
, v=possible_predictors
, repeats.allowed=FALSE)
} else {
selected_predictors <- t(replicate(400, sample(possible_predictors, size = i)))
}
for(r in 1:nrow(selected_predictors)) {
predictors <- selected_predictors[r,]
f_string <- paste('high_use ~ ', paste(predictors, collapse = '+'))
m <- tryCatch(
glm(as.formula(f_string), data = alc, family = "binomial"),
error=function(e){return(NULL)},
warning=function(w) {return(NULL)})
if(!is.null(m)) {
# Calculate training error
alc_loc <- data.frame(alc)
p <- predict(m, type = "response")
alc_loc <- alc_loc %>% mutate(probability = p, prediction = probability > 0.5)
loss <- loss_func(class = alc_loc$high_use, prob = alc_loc$probability)
cv <- cv.glm(data = alc_loc, cost = loss_func, glmfit = m, K = 10)
# Store training and testing errors
mdf[nrow(mdf) + 1,] <- list(i, TRUE, loss, f_string)
mdf[nrow(mdf) + 1,] <- list(i, FALSE, cv$delta[1], f_string)
}
}
}
As could be seen from the summary figure, the model errors tend to become smaller with increasing the size of the explanatory vector (there training indicates training errors if TRUE and prediction errors if FALSE). This could be attributed to the non-totality of the selected approach which makes it easier to draw a set of ‘bad’ variables if the explanatory vector is short. From the other side for long explanatory vectors there is a higher probability of getting ‘good’ variables from a random draw.
mdf$size <- as.factor(mdf$size)
mdf %>%
ggplot(aes(x = size, y = error, col=training)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
From all the tested models, the following give the lowest prediction error:
mdf[mdf$training == FALSE,] %>% slice_min(error)
## size training error
## 1 13 FALSE 0.2041885
## 2 9 FALSE 0.2041885
## model
## 1 high_use ~ activities+Mjob+internet+studytime+absences+paid+sex+reason+nursery+Fedu+age+goout+higher
## 2 high_use ~ reason+G2+activities+famsup+absences+sex+goout+Medu+nursery
The model with the the lowest prediction error (there the model with fewer explanatory variables was taken) shows some improvement in the prediction skill though still tends to give a considerable amount of false negative predictions according to the provided summary:
best_score <- mdf[mdf$training == FALSE,] %>% slice_min(error)
best_formula <- as.formula(best_score$model[2])
better_model <- glm(as.formula(best_formula), data = alc, family = "binomial")
summary(better_model)
##
## Call:
## glm(formula = as.formula(best_formula), family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0255 -0.7642 -0.4861 0.8064 2.7573
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.51951 0.82443 -4.269 1.96e-05 ***
## reasonhome 0.18904 0.32007 0.591 0.554773
## reasonother 1.09637 0.45591 2.405 0.016182 *
## reasonreputation -0.20941 0.35264 -0.594 0.552630
## G2 -0.02247 0.04843 -0.464 0.642723
## activitiesyes -0.47083 0.26752 -1.760 0.078409 .
## famsupyes 0.04418 0.27130 0.163 0.870655
## absences 0.08698 0.02364 3.679 0.000234 ***
## sexM 1.03519 0.27176 3.809 0.000139 ***
## goout 0.76352 0.12690 6.017 1.78e-09 ***
## Medu -0.03218 0.12743 -0.253 0.800646
## nurseryyes -0.45547 0.32090 -1.419 0.155799
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 373.58 on 370 degrees of freedom
## AIC: 397.58
##
## Number of Fisher Scoring iterations: 4
p2 <- predict(better_model, type = "response")
a2 <- data.frame(alc)
a2 <- a2 %>% mutate(probability = p2, prediction = probability > 0.5)
table(high_use = a2$high_use, prediction = a2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 251 17
## TRUE 57 57
a2 %>%
ggplot(aes(x = probability, y = high_use)) +
geom_violin(trim = TRUE, fill='#DCDCDC') +
geom_vline(xintercept = 0.5) +
xlim(0,1) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
A short summary and things learned
In this exercise we use one of the standard data sets provided with R-language within the MASS package:
data("Boston")
The data set we use is called Boston and it provides information about the housing values in suburbs of Boston as well as quite an extensive set of additional information:
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
where
crim is per capita crime rate by town.zn proportion of residential land zoned for lots over 25,000 sq.ft.indus proportion of non-retail business acres per town.chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox nitrogen oxides concentration (parts per 10 million).rm average number of rooms per dwelling.age proportion of owner-occupied units built prior to 1940.dis weighted mean of distances to five Boston employment centers.rad index of accessibility to radial highways.tax full-value property-tax rate per $10,000.ptratio pupil-teacher ratio by town.black \(1000\cdot(\text{Bk} - 0.63)^2\) where Bk is the proportion of blacks by town.lstat lower status of the population (percent).medv median value of owner-occupied homes in $1000s.In total, the Bostan data sets has 506 observations of 14 individual variables.
Having such a considerable number of different variables, it could be interesting to investigate their distributions and relationships between each other. First the estimated distributions for all the numeric variables are computed:
boston_labels <- c(
'crime rate'
, 'fraction of lots > 25000 sq.ft.'
, 'fraction of non-retail business land'
, 'NOx concentration'
, 'number of rooms'
, 'fraction of buldings from before 1940'
, 'distance to Boston employment centres'
, 'radial highway accessibility index'
, 'tax per $10000'
, 'pupil-teacher ratio'
, 'black/white prevalence'
, 'lower status percent'
, 'house median value'
)
i <- 1
for(c in names(Boston)) {
if(c == 'chas') next
p <- ggally_densityDiag(Boston, mapping = aes_string(x=c)) +
xlab(boston_labels[i]) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
var_name <- paste('p', i, sep='')
assign(var_name, p)
i <- i + 1
}
grid.arrange(
p1, p2, p3, p4, p5, p6
, p7, p8, p9, p10, p11, p12, p13, nrow = 7, ncol=2)
These distributions provide some insight on the investigated data set. For example the crime rate is generally low and most suburbs have either prevalent black or prevalent non-black population. Towns with mixed population seem to be less common. A considerable number of buildings in Boston suburbs is built before 1940, there are not many big houses with a lot of rooms and house value does not vary too much from town to town. Another point to notice there is that several variables in the Boston data set show bimodal distribution. These variables are
indus (fraction of non-retail business land), rad (radial highway accessibility index), and tax (tax per $10000) and they indicate some spatial irregularity in the Boston area. For example, it seems that there are locations with either good connections to radial highways or bad ones and not many locations with intermediate values of the accessibility index.
To get an overview on the relations between different variables correlation coefficients could be studied. To get a more clear picture in case of a considerable number of individual variables correlations in graphical form seem to be a good choice.
cor_matrix <- cor(Boston)
cor_matrix <- cor_matrix %>% round(2)
corrplot(cor_matrix, method="circle", tl.cex=0.6)
This plot shows that there is a number of variables in the Boston data set which show relatively strong correlations between each other. Some of these correlations follow my expectations, for example, the median price of buildings shows negative correlation with crime rate, or areas with higher fraction of non-retail business also have higher NOx levels. From the other side, some of the correlations are rather surprising to me, for example the number of rooms in a house seems to be negatively correlated with the property tax, though this correlation is weak.
The summary of the Boston data set provided by the summary function gives some additional details one the values of each individual variable supporting the conclusions from the figures.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
One thing to note there is that all the variables have non-zero mean values.
Before further processing of the data set it should be standardized because the original Boston data have variables with different units and scales which could affect the following analysis. To perform the standardization the R-function scale is applied:
boston_scaled <- scale(Boston) %>% as.data.frame
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
note that after standardization procedure the mean values of all the variables in the data set are zero. This procedure also changes the standard deviations of all the variables to be equal to one.
One of the tasks for this week exercise is to use linear discriminant analysis for classifying and predicting the crime rate in Boston suburbs. To do so, the standardized data set should be further processed. First, the crime rate (the crim variable), which is numeric in the original data set is converted to categorical form. Original variable is split in four categories representing low, medium low, medium high and high crime rate according to the quantiles of crim in the data set:
bins <- quantile(boston_scaled$crim)
labels <- c('low', 'med_low', 'med_high', 'high')
crime <- cut(boston_scaled$crim, breaks = bins, label=labels, include.lowest = TRUE)
boston_scaled <- select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Next, to be able to examine the predictive skill of the model, the full data set is split in two parts: training data set which consists of 80% of the original data and testing data containing the rest 20% of the data set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Now an LDA model could be trained with the crime rate as a target variable:
lda.fit <- lda(crime ~ ., data = train)
ggplot.lda.arrows <- function(p, x, myscale = 1, arrow_heads = 0.1, color = "red", choices = c(1,2)){
heads <- coef(x)
vals <- c(1:length(heads[,1]))
for(i in vals) {
p <- p + geom_segment(
x = 0
, y = 0
, xend = myscale * heads[i,choices[1]]
, yend = myscale * heads[i,choices[2]]
, color = color
, arrow = arrow(length = unit(arrow_heads, 'cm')))
p <- p + annotate(
'text'
, x=myscale*(heads[i,choices[1]] + 0.1)
, y=myscale*(heads[i,choices[2]] + 0.1)
, label = row.names(heads)[i],
, col=color)
}
return(p)
}
lda.data <- data.frame(type = train[,1], lda = predict(lda.fit)$x)
p <- ggplot(lda.data) +
geom_point(aes(lda.LD1, lda.LD2, color = train$crime), alpha=0.5, size = 2.5)
p <- ggplot.lda.arrows(p, lda.fit, myscale = 2) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
p
The LDA plot shows a well separated group group of points corresponding to the high crime rate. It seems that the
rad variable is the main contributor there and because rad corresponds to the accessibility index for the radial highways it means that areas with high accessibility tend to show higher crime rates. Other categories of crime rates form a somewhat merged group, although medium high rate could still be distinguished from low rate.
Isolated location of the high crime cluster indicates that the trained model should have a considerable predictive skill. To investigate this, the testing part of the data set is used to make model predictions and compare them with actual values:
correct_classes <- test$crime
test <- select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 11 1 0
## med_low 4 19 4 0
## med_high 0 9 18 2
## high 0 0 0 22
The cross tabulation table shows that indeed for high and medium high crime rates the LDA classificator has a considerable predictive skill with low amount of false negative predictions. For the medium low and low crime rates the models shows less skill.
So far I used the available information about the crime rates to build an LDA model and predict the crime rates of the testing data set. But clustering information is not always available from a data set. When such information is missing a clustering algorithm could be applied to find potentially distinctive groups. To illustrate an application of such approach the full Boston data set is used for finding separate clusters in it.
Again, as a first step the Boston data set is standardized, to remove different units of measure and normalize distances:
data('Boston')
boston_scaled <- Boston %>% scale %>% as.data.frame
In this part the k-means algorithm will be applied, which uses distances between individual observations and cluster centers. To show distances between the individual points of the Boston data set it could be passed as an argument to the distance function:
dist_eucl <- dist(boston_scaled)
summary(dist_eucl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
it could be seen that standardized distances are rather short and are not affected by the magnitude of values from the original data set.
The k-means algorithm requires information about the number of clusters to be provided by user. Although giving an arbitrary selected number of cluster would often work:
km <-kmeans(boston_scaled, centers = 3)
pairs(Boston, col = km$cluster)
estimating the optimal number of clusters is less trivial. The optimal number of classes could be estimated from the analysis of within-cluster-sum-of-squares or WCSS. When the value of WCSS is plotted against the total number of clusters, the optimal number of clusters would be indicated by a sharp drop in the WCSS value.
In the following example the k-means algorithm is applied 20 times with different numbers of cluster and the WCSS plot is provided:
k_max <- 20
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line') +
geom_point() +
geom_line() +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
as could be seen from the figure, the sharpest drop in the WCSS value happens with 2 clusters, which indicates that this would be the optimal number of clusters.
Using this estimated optimal number of clusters the k-means algoritm would give the following result:
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
from the summary figure it is not very clear what is the physical meaning (if any) behind the split parts of the data set. Estimated distributions by cluster give some idea on the meaning of two clusters (and here we use the original data set as having more clear physical meaning):
i <- 1
boston_km <- data.frame(Boston, cluster=as.factor(km$cluster))
for(c in names(boston_scaled)) {
if(c == 'chas') next
z <- 'cluster'
p <- ggally_densityDiag(boston_km, mapping = aes_string(x=c,color='cluster'), alpha=0.5) +
xlab(boston_labels[i]) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), aspect.ratio=0.25)
var_name <- paste('p', i, sep='')
assign(var_name, p)
i <- i + 1
}
grid.arrange(
p1, p2, p3, p4, p5, p6
, p7, p8, p9, p10, p11, p12, p13, nrow = 7, ncol=2)
clusters seem to divide some variables in logical way, like cluster 1 showing higher taxes and older houses, while cluster 2 has lower highway accessibility index and lower NOx.
Clusters found by the k-means algorithm could be used a target variable, as shown in the following example. There output from k-means with 4 clusters is added to the Boston data set to train an LDA model:
data('Boston')
boston_scaled <- Boston %>% scale %>% as.data.frame
km <-kmeans(boston_scaled, centers = 4)
boston_scaled_km <- data.frame(boston_scaled, cluster=km$cluster)
lda_km.fit <- lda(cluster ~ ., data = boston_scaled_km)
lda_km.data <- data.frame(type = boston_scaled[,1], lda = predict(lda_km.fit)$x)
p <- ggplot(lda_km.data) +
geom_point(aes(lda.LD1, lda.LD2, color = as.factor(km$cluster)), alpha=0.5, size = 2.5)
p <- ggplot.lda.arrows(p, lda_km.fit, myscale = 2) +
theme_bw() +
theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank())
p
The summary figure shows that one of the cluster is associated with the chas variable and another with the rad variable while other clusters are less pronounced. This rather striking impact of chas could be caused to some extent by the nature of this variable because in the Boston data set chas is a dummy variable taking values of 0 or 1, which breaks the assumptions of LDA.
Another approach for visualizing an LDA model is provided on the following figure:
model_predictors <- select(train, -crime)
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(
x = matrix_product$LD1
, y = matrix_product$LD2
, z = matrix_product$LD3
, type= 'scatter3d'
, color=train$crime
, size=2
, mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
on this figure each axis corresponds to the individual linear discriminant and unlike the 2D plot this figure shows more complete information about the fitted LDA.
A similar figure could be obtained by coloring points not by crime rate but rather by clusters found by the k-means algorithm:
km <-kmeans(select(train, -crime), centers = 4)
plot_ly(
x = matrix_product$LD1
, y = matrix_product$LD2
, z = matrix_product$LD3
, type= 'scatter3d'
, color=as.factor(km$cluster)
, size=2
, mode='markers')
both figures show somewhat similar features with a distinctive separate group (which corresponds to high crime rates according to LDA) almost exclusively occupied by a single cluster. Patterns in the main group of points although showing some similarity are less obvious.
(more chapters to be added similarly as we proceed with the course!)